Compiling Reports from R Scripts

Assignment 11 - My Draft Report

リボソームタンパク質の部分的機能変異と分子進化

1. リボソーム複合体の一部であるリボソームタンパク質(s)

リボソーム複合体の一部であるリボソームタンパク質(s).

リボソーム複合体の一部であるリボソームタンパク質(s).

2. 背景

  • ○リボソームとは:

    • アミノ酸を重合させてタンパク質を合成

    • 生物の細胞中に普遍的に存在

  • ○リボソームタンパク質とは:

    • 50Sと30Sの大小各1個ずつのサブユニットから構成

  • ○S7について:

    • 16S rRNAとではなくtRNAとの相互作用

      完全に溶媒領域にさらされたフレキシブルなβアームとα6 完全に溶媒領域にさらされたフレキシブルなβアームとα6 .
  • ○L2について:

    • 2つの核酸結合モチーフを持つ

    • 核酸結合タンパク質の間での分子進化を示唆

      Bacillus由来のL2とMetanococcus由来EIF-5Aとの構造,C末はSH3-like barrel,N末はOB-foldと類似 Bacillus由来のL2とMetanococcus由来EIF-5Aとの構造,C末はSH3-like barrel,N末はOB-foldと類似.
  • ○L5について:

    • RRM (RNA Recognition Motif) を持つ

      L5と類似のモチーフRRM (RNA認識に関わる) を持つタンパク質 L5と類似のモチーフRRM (RNA認識に関わる) を持つタンパク質.

3. 対象と手法

対象生物:

  • バクテリア:
    • モデル生物:
      • 大腸菌(Escherichia coli
      • 枯草菌(Bacillus subtilis
      • シアノバクテリア(Cyanobacteria
      • サーマス・サーモフィルス(Thermus thermophilus
    • 非モデル生物:
      • ゲオバチルス・ステアロサーモフィルス(Bacillus stearothermophilus
  • 古細菌:
    • ハロアーキュラ・マリスモルツイ(Haloarcula marismortui
    • メタノカルドコックス・ヤンナスキイ(Metanococcus jannaschii

手法:

  • NCBIより対象生物のタンパク質アミノ酸配列のダウンロード
  • アミノ酸配列に対するマルチプルアライメント解析
  • 系統樹の描画
  • PDBによるタンパク質の立体構造の確認
  • Chimeraで変異が起きたアミノ酸のマッピング
  • 変異がもたらすタンパク質の機能変化の考察
  • 分子進化について議論

環境:

huang@MuyangnoMBP ~ % uname -a
Darwin MuyangnoMBP 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar  4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64

huang@MuyangnoMBP ~ % sw_vers
ProductName:    Mac OS X
ProductVersion: 10.15.4
BuildVersion:   19E287

R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)
[R.app GUI 1.71 (7827) x86_64-apple-darwin17.0]
R.app GUI 1.71 (7827 Catalina build), S. Urbanek & H.-J. Bibiko, © R Foundation for Statistical Computing, 2016
RStudio 1.2.5042, © 2009-2020 RStudio, Inc.

Chimera-1.14-mac64, Nov. 13, 2019,  © 2019 Regents of the University of California.  All Rights Reserved.

Chunk options

Retrieving genome sequence data using SeqinR.

RパッケージSeqinRを用いて、アミノ酸配列データを取得する.

seqinr等のパッケージの呼び出し:

# load the R package.
library(seqinr)
library(Biostrings)
library(msa)
library(ape)

○S7について:

Answer the following questions. For each question, please record your answer, and what you typed into R to get this answer.

Q1. Calculate the genetic distances between > 3 protein sequences of interest. Which are the most closely related proteins, based on the genetic distances?

write out the sequences to a FASTA file

write.fasta(seqs_S7, seqnames_S7, file="myseq_Assignment_S7.fasta")

Read an XStringSet object from a file

mySequences_S7 <- readAAStringSet(file = "myseq_Assignment_S7.fasta")

Multiple Sequence Alignment using ClustalW

myAlignment_S7 <- msa(mySequences_S7, "ClustalW")
## use default substitution matrix
print(myAlignment_S7, show="complete")
## 
## MsaAAMultipleAlignment with 6 rows and 239 columns
##     aln (1..54)                                            names
## [1] ----------------------------------MPRKGPVAKRDVLPDPI--- P21469
## [2] ----------------------------------MPRRGPVAKRDVLPDPI--- P22744
## [3] ----------------------------------MPRRRVIGQRKILPDPK--- P02359
## [4] ----------------------------------MARRRRAEVRQLQPDLV--- P17291
## [5] MSAEDTPEADADAAEESEPETARAKLFGEWDITDIEYSDPSTERYITVTPI--A P32552
## [6] ------------------MELDEIKVFGRWSTKDVVVKDPGLRNYINLTPIYVP P54063
## Con ----------------------------------MPRR?P???R?ILPDPI--- Consensus 
## 
##     aln (55..108)                                          names
## [1] -------------YNSKLVSRLINKMMI----DGKKGKSQTILYKSFDIIKERT P21469
## [2] -------------YNSKLVTRLINKIMI----DGKKSKAQKILYTAFDIIRERT P22744
## [3] -------------FGSELLAKFVNILMV----DGKKSTAESIVYSALETLAQRS P02359
## [4] -------------YGDVLVTAFINKIMR----DGKKNLAARIFYDACKIIQEKT P17291
## [5] HTMGRHADKQFKKSEISIVERLINRLMQTDENTGKKQLATSIVTEAFELVHERT P32552
## [6] HTAGRYTKRQFEKAKMNIVERLVNKVMRREENTGKKLKALKIVENAFEIIEKRT P54063
## Con -------------Y?S?LV?RLINK?M?----DGKK?KA??IVY?AFEII?ERT Consensus 
## 
##     aln (109..162)                                         names
## [1] GNDAMEVFEQALKNIMPVLEVKARRVGGANYQVPVEVRPERRTTLGLRWLVN-- P21469
## [2] GKDPMEVFEQALKNVMPVLEVRARRVGGANYQVPVEVRPDRRVSLGLRWLVQ-- P22744
## [3] GKSELEAFEVALENVRPTVEVKSRRVGGSTYQVPVEVRPVRRNALAMRWIVE-- P02359
## [4] GQEPLKVFKQAVENVKPRMEVRSRRVGGANYQVPMEVSPRRQQSLALRWLVQ-- P17291
## [5] DENPIQVLVSAVENSAPREETVRLKYGGISVPKAVDVAPQRRVDQALKFLAEGV P32552
## [6] KQNPIQVLVDAIENAGPREDTTRISYGGIVYLQSVDCSSLRRIDVALRNIALGA P54063
## Con G??P?EVFEQALENV?PR?EV??RRVGGANYQVPVEVRP?RR??LALRWLV?-- Consensus 
## 
##     aln (163..216)                                         names
## [1] -YARLRGEKTMEERLANEILDAAN---NTGAAVKKREDTHKMAEANKAFAHYRW P21469
## [2] -YARLRNEKTMEERLANEIMDAAN---NTGAAVKKREDTHKMAEANKAFAHYRW P22744
## [3] -AARKRGDKSMALRLANELSDAAE---NKGTAVKKREDVHRMAEANKAFAHYRW P02359
## [4] -AANQRPERRAAVRIAHELMDAAE---GKGGAVKKKEDVERMAEANRAYAHYRW P17291
## [5] YGGSFKTTTTAAEALAQQLIGAANDDVQT-YAVNQKEEKERVAAAAR------- P32552
## [6] YMAAHKSKKPIEEALAEEIIAAARGDMQKSYAVRKKEETERVAQSAR------- P54063
## Con -?AR?R?EKTM?ERLANE??DAAN---N?G?AVKK?EDT?RMAEAN?AFAHYRW Consensus 
## 
##     aln (217..239)          names
## [1] ----------------------- P21469
## [2] ----------------------- P22744
## [3] LSLRSFSHQAGASSKQPALGYLN P02359
## [4] ----------------------- P17291
## [5] ----------------------- P32552
## [6] ----------------------- P54063
## Con ----------------------- Consensus

S7のConsensus配列 Chimeraを用いて可視化したS7のConsensus配列.

write an XStringSet object to a file

writeXStringSet(unmasked(myAlignment_S7), file = "myaln_Assignment_S7.fasta")

read the FASTA-format alignment into R

myaln_S7 <- read.alignment(file = "myaln_Assignment_S7.fasta", format = "fasta")

calculate the genetic distances between the protein sequences

mydist_S7 <- dist.alignment(myaln_S7)
mydist_S7
##           P21469    P22744    P02359    P17291    P32552
## P22744 0.1961161                                        
## P02359 0.4160251 0.4082483                              
## P17291 0.4311582 0.4082483 0.4236593                    
## P32552 0.5452498 0.5067117 0.5870218 0.5263336          
## P54063 0.5616371 0.5309230 0.6075587 0.5792844 0.4436311

get sequence annotations

unlist(getAnnot(seqs_S7))
## [1] "sp|P02359|RS7_ECOLI 30S ribosomal protein S7 OS=Escherichia coli (strain K12) OX=83333 GN=rpsG PE=1 SV=3"                                                                  
## [2] "sp|P21469|RS7_BACSU 30S ribosomal protein S7 OS=Bacillus subtilis (strain 168) OX=224308 GN=rpsG PE=1 SV=4"                                                                
## [3] "sp|P17291|RS7_THET8 30S ribosomal protein S7 OS=Thermus thermophilus (strain HB8 / ATCC 27634 / DSM 579) OX=300852 GN=rpsG PE=1 SV=3"                                      
## [4] "sp|P22744|RS7_GEOSE 30S ribosomal protein S7 OS=Geobacillus stearothermophilus OX=1422 GN=rpsG PE=1 SV=3"                                                                  
## [5] "sp|P32552|RS7_HALMA 30S ribosomal protein S7 OS=Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809) OX=272569 GN=rps7 PE=1 SV=2"                 
## [6] "sp|P54063|RS7_METJA 30S ribosomal protein S7 OS=Methanocaldococcus jannaschii (strain ATCC 43067 / DSM 2661 / JAL-1 / JCM 10045 / NBRC 100440) OX=243232 GN=rps7 PE=3 SV=1"

Bacillus subtilis(P21469) and Geobacillus stearothermophilus(P22744) are the most closely related proteins, based on the genetic distances.

Q2. Build an unrooted phylogenetic tree of the proteins, using the neighbour-joining algorithm. Which are the most closely related proteins, based on the tree?

# construct a phylogenetic tree with the neighbor joining algorithm
mytree_S7 <- nj(mydist_S7)
plot.phylo(mytree_S7, type="unrooted")

Bacillus subtilis(P21469) and Geobacillus stearothermophilus(P22744) are the most closely related proteins, based on the tree.

Q3. Build a rooted phylogenetic tree of the proteins, using an outgroup. Which are the most closely related proteins, based on the tree? What extra information does this tree tell you, compared to the unrooted tree in Q2?

mytree_S7 <- root(mytree_S7, outgroup = "P54063", resolve.root = TRUE)
plot.phylo(mytree_S7, main = "Phylogenetic Tree")

Bacillus subtilis(P21469) and Geobacillus stearothermophilus(P22744) are the most closely related proteins, based on the tree. Escherichia coli(P02359) is more closely related to Bacillus subtilis(P21469) and Geobacillus stearothermophilus(P22744) rather than Thermus thermophilus(P17291).

○L2について:

Answer the following questions. For each question, please record your answer, and what you typed into R to get this answer.

Q1. Calculate the genetic distances between > 3 protein sequences of interest. Which are the most closely related proteins, based on the genetic distances?

write out the sequences to a FASTA file

write.fasta(seqs_L2, seqnames_L2, file="myseq_Assignment_L2.fasta")

Read an XStringSet object from a file

mySequences_L2 <- readAAStringSet(file = "myseq_Assignment_L2.fasta")

Multiple Sequence Alignment using ClustalW

myAlignment_L2 <- msa(mySequences_L2, "ClustalW")
## use default substitution matrix
print(myAlignment_L2, show="complete")
## 
## MsaAAMultipleAlignment with 6 rows and 283 columns
##     aln (1..54)                                            names
## [1] MAIKKYKPTSNGRRGMTTSDFAEITTDKPEKSLLAPLHKKGGRNNQGKLTVRHQ P42919
## [2] MAIKKYKPTSNGRRGMTVLDFSEITTDQPEKSLLAPLKKRAGRNNQGKITVRHQ P04257
## [3] MAVKKFKPYTPSRRFMTVADFSEITKTEPEKSLVKPLKKTGGRNNQGRITVRFR P60405
## [4] MAVVKCKPTSPGRRHVVKVVNPELHKGKPFAPLLEKNSKSGGRNNNGRITTRHI P60422
## [5] ----------MGRR-------------------IQGQRRGRGTSTFRAPSHRYK P20276
## [6] ----------MGKR-------------------LISQRRGRGSSVYTCPSHKRR P54017
## Con MA?KK?KPTS?GRR?MT??DF?EIT???PEKSLL?PL?K?GGRNNQG?ITVRH? Consensus 
## 
##     aln (55..108)                                          names
## [1] GGGHKRQYRVIDFKR-DKDGIPGRVATVEYDPNRSANIALINYADGEKRYILAP P42919
## [2] GGGHKRQYRIIDFKR-DKDGIPGRVATIEYDPNRSANIALINYADGEKRYIIAP P04257
## [3] GGGHKRLYRIIDFKRWDKVGIPAKVAAIEYDPNRSARIALLHYVDGEKRYIIAP P60405
## [4] GGGHKQAYRIVDFKR-NKDGIPAVVERLEYDPNRSANIALVLYKDGERRYILAP P60422
## [5] ADLEHR---KVEDGD----VIAGTVVDIEHDPARSAPVAAVEFEDGDRRLILAP P20276
## [6] GEAKYRRFDELEKKG----KVLGKIVDILHDPGRSAPVAKVEYETGEEGLLVVP P54017
## Con GGGHKR?YRIIDFKR-DKDGIPG?VA?IEYDPNRSANIALV?Y?DGEKRYILAP Consensus 
## 
##     aln (109..162)                                         names
## [1] KGIQVGTEIMSGPEADIKVGNALPLINIPVGTVVHNIELKPGKGGQLVRSAGTS P42919
## [2] KNLKVGMEIMSGPDADIKIGNALPLENIPVGTLVHNIELKPGRGGQLVRAAGTS P04257
## [3] DGLQVGQQVVAGPDAPIQVGNALPLRFIPVGTVVHAVELEPKKGAKLARAAGTS P60405
## [4] KGLKAGDQIQSGVDAAIKPGNTLPMRNIPVGSTVHNVEMKPGKGGQLARSAGTY P60422
## [5] EGVGVGDELQVGVSAEIAPGNTLPLAEIPEGVPVCNVESSPGDGGKFARASGVN P20276
## [6] EGVKVGDIIECGVSAEIKPGNILPLGAIPEGIPVFNIETVPGDGGKLVRAGGCY P54017
## Con KGLKVGDEI?SG?DA?IKPGNALPL?NIPVGT?VHN?ELKPGKGG?L?RAAGTS Consensus 
## 
##     aln (163..216)                                         names
## [1] AQVLGKEGKYVLVRLNSGEVRMILSACRASIGQVGNEQHELINIGKAGRSRWKG P42919
## [2] AQVLGKEGKYVIVRLASGEVRMILGKCRATVGEVGNEQHELVNIGKAGRARWLG P04257
## [3] AQIQGREGDYVILRLPSGELRKVHGECYATVGAVGNADHKNIVLGKAGRSRWLG P60405
## [4] VQIVARDGAYVTLRLRSGEMRKVEADCRATLGEVGNAEHMLRVLGKAGAARWRG P60422
## [5] AQLLTHDRNVAVVKLPSGEMKRLDPQCRATIGVVAGGGRTDKPFVKAGNKHHKM P20276
## [6] AHILTHDGERTYVKLPSGHIKALHSMCRATIGVVAGGGRKEKPFVKAGKKYHAM P54017
## Con AQILG??G?YV?VRLPSGE?R?????CRATIG?VGN??H?L???GKAGR?RW?G Consensus 
## 
##     aln (217..270)                                         names
## [1] IR-----PTVRGSVMNPNDHPHGGGEGRAPIGRKSPMSPWGKPTLGFKTRKKKN P42919
## [2] IR-----PTVRGSVMNPVDHPHGGGEGKAPIGRKSPMTPWGKPTLGYKTRKKKN P04257
## [3] RR-----PHVRGAAMNPVDHPHGGGEGRAPRGR-PPASPWGWQTKGLKTRKRRK P60405
## [4] VR-----PTVRGTAMNPVDHPHGGGEGRN-FGK-HPVTPWGVQTKGKKTRSNKR P60422
## [5] KARGTKWPNVRGVAMNAVDHPFGGG------GRQHPGKPKSISRN-APPGRKVG P20276
## [6] KAKAVKWPRVRGVAMNAVDHPFGGG------RHQHTGKPTTVSRKKVPPGRKVG P54017
## Con ?R-----PTVRG?AMNPVDHPHGGGEGRA??GR?HP??PWG??TKG?KTRKKK? Consensus 
## 
##     aln (271..283)  names
## [1] KSDKFIVRRRKNK  P42919
## [2] KSDKFIIRRRKK-  P04257
## [3] PSSRFIIARRKK-  P60405
## [4] -TDKFIVRRRSK-  P60422
## [5] DIASKRTGRGGNE  P20276
## [6] HISARRTGVRK--  P54017
## Con ?SDKFI?RRRKK-  Consensus

L2のConsensus配列 Chimeraを用いて可視化したL2のConsensus配列.

write an XStringSet object to a file

writeXStringSet(unmasked(myAlignment_L2), file = "myaln_Assignment_L2.fasta")

read the FASTA-format alignment into R

myaln_L2 <- read.alignment(file = "myaln_Assignment_L2.fasta", format = "fasta")

calculate the genetic distances between the protein sequences

mydist_L2 <- dist.alignment(myaln_L2)
mydist_L2
##           P42919    P04257    P60405    P60422    P20276
## P04257 0.2170287                                        
## P60405 0.4045199 0.3813850                              
## P60422 0.4364358 0.4364358 0.4688072                    
## P20276 0.6084511 0.6132441 0.5859587 0.5872202          
## P54017 0.6230455 0.6297813 0.6243641 0.6222813 0.4150529

get sequence annotations

unlist(getAnnot(seqs_L2))
## [1] "sp|P60422|RL2_ECOLI 50S ribosomal protein L2 OS=Escherichia coli (strain K12) OX=83333 GN=rplB PE=1 SV=2"                                                                  
## [2] "sp|P42919|RL2_BACSU 50S ribosomal protein L2 OS=Bacillus subtilis (strain 168) OX=224308 GN=rplB PE=1 SV=3"                                                                
## [3] "sp|P60405|RL2_THET8 50S ribosomal protein L2 OS=Thermus thermophilus (strain HB8 / ATCC 27634 / DSM 579) OX=300852 GN=rplB PE=1 SV=3"                                      
## [4] "sp|P04257|RL2_GEOSE 50S ribosomal protein L2 OS=Geobacillus stearothermophilus OX=1422 GN=rplB PE=1 SV=2"                                                                  
## [5] "sp|P20276|RL2_HALMA 50S ribosomal protein L2 OS=Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809) OX=272569 GN=rpl2 PE=1 SV=4"                 
## [6] "sp|P54017|RL2_METJA 50S ribosomal protein L2 OS=Methanocaldococcus jannaschii (strain ATCC 43067 / DSM 2661 / JAL-1 / JCM 10045 / NBRC 100440) OX=243232 GN=rpl2 PE=3 SV=2"

Bacillus subtilis(P42919) and Geobacillus stearothermophilus(P04257) are the most closely related proteins, based on the genetic distances.

Q2. Build an unrooted phylogenetic tree of the proteins, using the neighbour-joining algorithm. Which are the most closely related proteins, based on the tree?

# construct a phylogenetic tree with the neighbor joining algorithm
mytree_L2 <- nj(mydist_L2)
plot.phylo(mytree_L2, type="unrooted")

Bacillus subtilis(P42919) and Geobacillus stearothermophilus(P04257) are the most closely related proteins, based on the tree.

Q3. Build a rooted phylogenetic tree of the proteins, using an outgroup. Which are the most closely related proteins, based on the tree? What extra information does this tree tell you, compared to the unrooted tree in Q2?

mytree_L2 <- root(mytree_L2, outgroup = "P54017", resolve.root = TRUE)
plot.phylo(mytree_L2, main = "Phylogenetic Tree")

Bacillus subtilis(P42919) and Geobacillus stearothermophilus(P04257) are the most closely related proteins, based on the tree. Thermus thermophilus(P60405) is more closely related to Bacillus subtilis(P42919) and Geobacillus stearothermophilus(P04257) rather than Escherichia coli(P60422).

○L5について:

Answer the following questions. For each question, please record your answer, and what you typed into R to get this answer.

Q1. Calculate the genetic distances between > 3 protein sequences of interest. Which are the most closely related proteins, based on the genetic distances?

write out the sequences to a FASTA file

write.fasta(seqs_L5, seqnames_L5, file="myseq_Assignment_L5.fasta")

Read an XStringSet object from a file

mySequences_L5 <- readAAStringSet(file = "myseq_Assignment_L5.fasta")

Multiple Sequence Alignment using ClustalW

myAlignment_L5 <- msa(mySequences_L5, "ClustalW")
## use default substitution matrix
print(myAlignment_L5, show="complete")
## 
## MsaAAMultipleAlignment with 6 rows and 213 columns
##     aln (1..54)                                            names
## [1] MNR---LKEKYNKEIAPALMTKFNYDSVMQVPKIEKIVINMGVGDAVQNAKAID P12877
## [2] MNR---LKEKYVKEVVPALMSKFNYKSIMQVPKIEKIVINMGVGDAVQNPKALD P08895
## [3] MPLDVALKRKYYEEVRPELIRRFGYQNVWEVPRLEKVVINQGLGEAKEDARILE P41201
## [4] MAK---LHDYYKDEVVKKLMTEFNYNSVMQVPRVEKITLNMGVGEAIADKKLLD P62399
## [5] ---------------MSSESESGGDFHEMREPRIEKVVVHMGIGHGGRD---LA P14124
## [6] ---------------MSFEELWQK--NPMLKPRIEKVVVNFGVGESGDR---LT P54040
## Con M??---LK?KY??EV?P?LM??FNY?SVMQVPRIEK?VINMGVGEA??D?K?LD Consensus 
## 
##     aln (55..108)                                          names
## [1] SAVEELTFIAGQKPVVTRAKKSIAGFRLREGMPIGAKVTLRGERMYDFLDKLIS P12877
## [2] SAVEELTLIAGQRPVVTRAKKSIAGFRLRQGMPIGAKVTLRGERMYEFLDKLIS P08895
## [3] KAAQELALITGQKPAVTRAKKSISNFKLRKGMPIGLRVTLRRDRMWIFLEKLLN P41201
## [4] NAAADLAAISGQKPLITKARKSVAGFKIRQGYPIGCKVTLRGERMWEFFERLIT P62399
## [5] NAEDILGEITGQMPVRTKAKRTVGEFDIREGDPIGAKVTLRDEMAEEFLQTALP P14124
## [6] KGAQVIEELTGQKPIRTRAKQTNPSFGIRKKLPIGLKVTLRGKKAEEFLKNAFE P54040
## Con ?AA?EL??ITGQKPVVTRAKKSIAGF??R?GMPIGAKVTLRGERM?EFL?KLI? Consensus 
## 
##     aln (109..162)                                         names
## [1] VSLPRVRDFRGVSKKSFDGRGNYTLGIKEQLIFPEIDYDKVTKVRGMDIVIVTT P12877
## [2] VSLPRVRDFRGVSKKAFDGRGNYTLGIKEQLIFPEIDYDKVNKVRGMDIVIVTT P08895
## [3] VALPRIRDFRGLNPNSFDGRGNYNLGLREQLIFPEITYDMVDALRGMDIAVVTT P41201
## [4] IAVPRIRDFRGLSAKSFDGRGNYSMGVREQIIFPEIDYDKVDRVRGLDITITTT P62399
## [5] LA--------ELATSQFDDTGNFSFGVEEHTEFPSQEYDPSIGIYGLDVTVNLV P14124
## [6] AFQ---KEGKKLYDYSFDDYGNFSFGIHEHIDFPGQKYDPMIGIFGMDVCVTLE P54040
## Con VALPR?RDFRGLS?KSFDGRGNYSLGI?EQLIFPEIDYDKV??VRGMDI??VTT Consensus 
## 
##     aln (163..213)                                      names
## [1] ANTDEEARELLTQVGMPFQK------------------------------- P12877
## [2] ANTDEEARELLALLGMPFQK------------------------------- P08895
## [3] AETDEEARALLELLGFPFRK------------------------------- P41201
## [4] AKSDEEGRALLAAFDFPFRK------------------------------- P62399
## [5] RPGYRVAKRDKASRSIPTKHRLNPADAVAFIESTYDVEVSE---------- P14124
## [6] RPGFRVKRRKRCRAKIPRRHRLTREEAIEFIEKTFGVKVERVLLEEEEETQ P54040
## Con A?TDEEAR?LLA??G?PFRK------------------------------- Consensus

L5のConsensus配列 Chimeraを用いて可視化したL5のConsensus配列.

write an XStringSet object to a file

writeXStringSet(unmasked(myAlignment_L5), file = "myaln_Assignment_L5.fasta")

read the FASTA-format alignment into R

myaln_L5 <- read.alignment(file = "myaln_Assignment_L5.fasta", format = "fasta")

calculate the genetic distances between the protein sequences

mydist_L5 <- dist.alignment(myaln_L5)
mydist_L5
##           P12877    P08895    P41201    P62399    P14124
## P08895 0.1830835                                        
## P41201 0.3737175 0.3811186                              
## P62399 0.3584573 0.3584573 0.3883787                    
## P14124 0.5937711 0.5937711 0.5937711 0.5773503          
## P54040 0.5987408 0.6039701 0.6244494 0.6039701 0.4659859

get sequence annotations

unlist(getAnnot(seqs_L5))
## [1] "sp|P62399|RL5_ECOLI 50S ribosomal protein L5 OS=Escherichia coli (strain K12) OX=83333 GN=rplE PE=1 SV=2"                                                                  
## [2] "sp|P12877|RL5_BACSU 50S ribosomal protein L5 OS=Bacillus subtilis (strain 168) OX=224308 GN=rplE PE=1 SV=1"                                                                
## [3] "sp|P41201|RL5_THETH 50S ribosomal protein L5 OS=Thermus thermophilus OX=274 GN=rplE PE=1 SV=3"                                                                             
## [4] "sp|P08895|RL5_GEOSE 50S ribosomal protein L5 OS=Geobacillus stearothermophilus OX=1422 GN=rplE PE=1 SV=1"                                                                  
## [5] "sp|P14124|RL5_HALMA 50S ribosomal protein L5 OS=Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809) OX=272569 GN=rpl5 PE=1 SV=4"                 
## [6] "sp|P54040|RL5_METJA 50S ribosomal protein L5 OS=Methanocaldococcus jannaschii (strain ATCC 43067 / DSM 2661 / JAL-1 / JCM 10045 / NBRC 100440) OX=243232 GN=rpl5 PE=3 SV=1"

Bacillus subtilis(P12877) and Geobacillus stearothermophilus(P08895) are the most closely related proteins, based on the genetic distances.

Q2. Build an unrooted phylogenetic tree of the proteins, using the neighbour-joining algorithm. Which are the most closely related proteins, based on the tree?

# construct a phylogenetic tree with the neighbor joining algorithm
mytree_L5 <- nj(mydist_L5)
plot.phylo(mytree_L5, type="unrooted")

Bacillus subtilis(P12877) and Geobacillus stearothermophilus(P08895) are the most closely related proteins, based on the tree.

Q3. Build a rooted phylogenetic tree of the proteins, using an outgroup. Which are the most closely related proteins, based on the tree? What extra information does this tree tell you, compared to the unrooted tree in Q2?

mytree_L5 <- root(mytree_L5, outgroup = "P54040", resolve.root = TRUE)
plot.phylo(mytree_L5, main = "Phylogenetic Tree")

Bacillus subtilis(P12877) and Geobacillus stearothermophilus(P08895) are the most closely related proteins, based on the tree. Escherichia coli(P62399) is more closely related to Bacillus subtilis(P12877) and Geobacillus stearothermophilus(P08895) rather than Thermus thermophilus(P41201).

References

sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] ja_JP.UTF-8/ja_JP.UTF-8/ja_JP.UTF-8/C/ja_JP.UTF-8/ja_JP.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] ape_5.4             msa_1.20.0          Biostrings_2.56.0  
## [4] XVector_0.28.0      IRanges_2.22.2      S4Vectors_0.26.1   
## [7] BiocGenerics_0.34.0 seqinr_3.6-1       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6    knitr_1.28      magrittr_1.5    zlibbioc_1.34.0
##  [5] MASS_7.3-51.6   lattice_0.20-41 rlang_0.4.6     stringr_1.4.0  
##  [9] highr_0.8       tools_4.0.0     grid_4.0.0      nlme_3.1-148   
## [13] xfun_0.14       htmltools_0.4.0 yaml_2.2.1      ade4_1.7-15    
## [17] digest_0.6.25   crayon_1.3.4    evaluate_0.14   rmarkdown_2.2  
## [21] stringi_1.4.6   compiler_4.0.0